AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
We have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy a personal loan or not.
The data contains the different attributes of liability customers of AllLife bank. The detailed data dictionary is given below.
Data Dictionary
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Sequential feature selector is present in mlxtend library
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
loan_data = pd.read_csv("./datasets/Loan_Modelling.csv")
# checking shape of the data
print(f"There are {loan_data.shape[0]} rows and {loan_data.shape[1]} columns.")
There are 5000 rows and 14 columns.
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Observations
# Read the csv into df variable
df = pd.read_csv("./datasets/Loan_Modelling.csv", dtype={"ZIPCode": "category"})
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null category 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: category(1), float64(1), int64(12) memory usage: 537.5 KB
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
# Let's view sample of the data
df.sample(n=10, random_state=1)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2764 | 2765 | 31 | 5 | 84 | 91320 | 1 | 2.9 | 3 | 105 | 0 | 0 | 0 | 0 | 1 |
| 4767 | 4768 | 35 | 9 | 45 | 90639 | 3 | 0.9 | 1 | 101 | 0 | 1 | 0 | 0 | 0 |
| 3814 | 3815 | 34 | 9 | 35 | 94304 | 3 | 1.3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3499 | 3500 | 49 | 23 | 114 | 94550 | 1 | 0.3 | 1 | 286 | 0 | 0 | 0 | 1 | 0 |
| 2735 | 2736 | 36 | 12 | 70 | 92131 | 3 | 2.6 | 2 | 165 | 0 | 0 | 0 | 1 | 0 |
| 3922 | 3923 | 31 | 4 | 20 | 95616 | 4 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2701 | 2702 | 50 | 26 | 55 | 94305 | 1 | 1.6 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1179 | 1180 | 36 | 11 | 98 | 90291 | 3 | 1.2 | 3 | 0 | 0 | 1 | 0 | 0 | 1 |
| 932 | 933 | 51 | 27 | 112 | 94720 | 3 | 1.8 | 2 | 0 | 0 | 1 | 1 | 1 | 1 |
| 792 | 793 | 41 | 16 | 98 | 93117 | 1 | 4.0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
# checking for unique values in ID column
df["ID"].nunique()
5000
Observation:
ID column is just an ID for the customer and has all unique values. It will add no value to our analysis. So, we will drop it.ZIPCode will need some processing before we are able to explore it. It has to be transformed from number to category.# drop the ID column as it does not add any value to the analysis
df.drop("ID", axis=1, inplace=True)
df.isnull().sum()
Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
df.duplicated().sum()
0
ZIPCode may look like a numerical vaue but it is a categorical value. Converting zip codes to cities will make the data more intuitive.df["ZIPCode"].nunique()
467
from uszipcode import SearchEngine
search = SearchEngine()
# let's define a function to convert the ZIPCode column to city
def zip_code_to_county(zipcode):
"""
This function takes in a zipcode and outputs the county it belongs to.
"""
zcode = search.by_zipcode(zipcode)
return zcode.county
# let's apply the function to the ZIPCode column
df["County"] = df["ZIPCode"].apply(zip_code_to_county)
counts = df.County.value_counts()
counts
Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Kern County 54 Marin County 54 Solano County 33 San Luis Obispo County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Siskiyou County 7 Tuolumne County 7 Merced County 4 Lake County 4 Trinity County 4 Imperial County 3 Napa County 3 Name: County, dtype: int64
Observation:
df[df.County.isna() == True].County.value_counts(dropna=False)
NaN 34 Name: County, dtype: int64
df[df.County.isna() == True].ZIPCode.unique()
['92717', '93077', '92634', '96651'] Categories (4, object): ['92717', '93077', '92634', '96651']
df.County.fillna("Others", inplace=True)
df.County.isnull().sum()
0
df.drop("ZIPCode", axis=1, inplace=True)
df.sample(n=10, random_state=5)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 46 | 20 | 158 | 1 | 2.40 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 1482 | 60 | 35 | 8 | 1 | 0.10 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 3021 | 54 | 28 | 159 | 2 | 0.50 | 1 | 461 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3867 | 44 | 19 | 61 | 3 | 2.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Mateo County |
| 637 | 53 | 28 | 31 | 4 | 0.10 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4191 | 42 | 15 | 39 | 3 | 1.00 | 2 | 132 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 3042 | 52 | 26 | 78 | 3 | 3.00 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 775 | 65 | 39 | 23 | 3 | 0.70 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 3767 | 40 | 16 | 83 | 4 | 2.67 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Sacramento County |
| 3954 | 32 | 7 | 134 | 2 | 3.10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Barbara County |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Observations:
# checking -ve values in Experience
df.sort_values(by=["Experience"], ascending=True).head(5)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 24 | -3 | 41 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 2618 | 23 | -3 | 55 | 3 | 2.4 | 2 | 145 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 4285 | 23 | -3 | 149 | 2 | 7.2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Kern County |
| 3626 | 24 | -3 | 28 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 3796 | 24 | -2 | 50 | 3 | 2.4 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | Marin County |
df[df["Experience"] < 0].Experience.count()
52
Observations:
df["Experience"] = df["Experience"].abs()
df.sort_values(by=["Experience"], ascending=True).head(5)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2756 | 27 | 0 | 40 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 2009 | 25 | 0 | 99 | 1 | 1.9 | 1 | 323 | 0 | 0 | 0 | 0 | 0 | Orange County |
| 4393 | 24 | 0 | 59 | 4 | 1.6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Humboldt County |
| 347 | 25 | 0 | 43 | 2 | 1.6 | 3 | 0 | 0 | 1 | 1 | 1 | 1 | Santa Clara County |
| 4425 | 26 | 0 | 164 | 2 | 4.0 | 3 | 301 | 1 | 0 | 0 | 1 | 0 | Butte County |
# checking extreme values in Income
df.sort_values(by=["Income"], ascending=False).head(10)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 48 | 24 | 224 | 2 | 6.67 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | Monterey County |
| 4993 | 45 | 21 | 218 | 2 | 6.67 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 526 | 26 | 2 | 205 | 1 | 6.33 | 1 | 271 | 0 | 0 | 0 | 0 | 1 | Santa Barbara County |
| 2988 | 46 | 21 | 205 | 2 | 8.80 | 1 | 181 | 0 | 1 | 0 | 1 | 0 | El Dorado County |
| 4225 | 43 | 18 | 204 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 677 | 46 | 21 | 204 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 2278 | 30 | 4 | 204 | 2 | 4.50 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3804 | 47 | 22 | 203 | 2 | 8.80 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Sacramento County |
| 2101 | 35 | 5 | 203 | 1 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Santa Clara County |
| 787 | 45 | 15 | 202 | 3 | 10.00 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Los Angeles County |
Observations:
# checking extreme values in Credit Card Average
df.sort_values(by=["CCAvg"], ascending=False).head(10)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 787 | 45 | 15 | 202 | 3 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Los Angeles County |
| 2101 | 35 | 5 | 203 | 1 | 10.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Santa Clara County |
| 2337 | 43 | 16 | 201 | 1 | 10.0 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | Santa Clara County |
| 3943 | 61 | 36 | 188 | 1 | 9.3 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | Ventura County |
| 3822 | 63 | 33 | 178 | 4 | 9.0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Los Angeles County |
| 1339 | 52 | 25 | 180 | 2 | 9.0 | 2 | 297 | 1 | 0 | 0 | 1 | 0 | Alameda County |
| 9 | 34 | 9 | 180 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Ventura County |
| 2769 | 33 | 9 | 183 | 2 | 8.8 | 3 | 582 | 1 | 0 | 0 | 1 | 0 | Ventura County |
| 2447 | 44 | 19 | 201 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Sacramento County |
| 917 | 45 | 20 | 200 | 2 | 8.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
Observations:
# checking extreme values in Mortgage
df.sort_values(by=["Mortgage"], ascending=False).head(10)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2934 | 37 | 13 | 195 | 2 | 6.5 | 1 | 635 | 0 | 0 | 0 | 1 | 0 | San Bernardino County |
| 303 | 49 | 25 | 195 | 4 | 3.0 | 1 | 617 | 1 | 0 | 0 | 0 | 0 | Yolo County |
| 4812 | 29 | 4 | 184 | 4 | 2.2 | 3 | 612 | 1 | 0 | 0 | 1 | 0 | San Diego County |
| 1783 | 53 | 27 | 192 | 1 | 1.7 | 1 | 601 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 4842 | 49 | 23 | 174 | 3 | 4.6 | 2 | 590 | 1 | 0 | 0 | 0 | 0 | Mendocino County |
| 1937 | 51 | 25 | 181 | 1 | 3.3 | 3 | 589 | 1 | 1 | 1 | 1 | 0 | Santa Clara County |
| 782 | 54 | 30 | 194 | 3 | 6.0 | 3 | 587 | 1 | 1 | 1 | 1 | 1 | San Diego County |
| 2769 | 33 | 9 | 183 | 2 | 8.8 | 3 | 582 | 1 | 0 | 0 | 1 | 0 | Ventura County |
| 4655 | 33 | 7 | 188 | 2 | 7.0 | 2 | 581 | 1 | 0 | 0 | 0 | 0 | Santa Clara County |
| 4345 | 26 | 1 | 184 | 2 | 4.2 | 3 | 577 | 1 | 0 | 1 | 1 | 1 | Alameda County |
Observations:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="lightpink"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(df, "Age", bins=30, kde=True)
Observations:
histogram_boxplot(df, "Experience", bins=40, kde=True)
Observations:
histogram_boxplot(df, "Income", kde=True)
Observations:
histogram_boxplot(df, "CCAvg", kde=True)
Observations:
histogram_boxplot(df, "Mortgage", kde=True)
Observations
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, rotation=90):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=rotation, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "Personal_Loan", perc=True)
labeled_barplot(df, "Family", perc=True, rotation=0)
Observations:
The below observations are made with assumption that the customer is the head of the family
labeled_barplot(df, "Education", perc=True, rotation=0)
Observations:
labeled_barplot(df, "County", perc=False)
Observations:
Zooming into this plot gives us the below information.
labeled_barplot(df, "Securities_Account", perc=True)
labeled_barplot(df, "CD_Account", perc=True)
labeled_barplot(df, "Online", perc=True)
labeled_barplot(df, "CreditCard", perc=True)
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations:
sns.pairplot(data=df, hue="Family", palette="bright")
plt.show()
Observations:
Zooming into these plots gives us the below information
Let's check the variation in Personal Loan with some of the other variables.
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 117)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "County", "Personal_Loan")
Personal_Loan 0 1 All County All 4520 480 5000 Los Angeles County 984 111 1095 Santa Clara County 492 71 563 San Diego County 509 59 568 Alameda County 456 44 500 Orange County 309 30 339 San Francisco County 238 19 257 Sacramento County 169 15 184 Monterey County 113 15 128 San Mateo County 192 12 204 Contra Costa County 73 12 85 Ventura County 103 11 114 Santa Barbara County 143 11 154 Yolo County 122 8 130 Santa Cruz County 60 8 68 Kern County 47 7 54 Marin County 48 6 54 Riverside County 50 6 56 Sonoma County 22 6 28 San Luis Obispo County 28 5 33 Shasta County 15 3 18 Others 31 3 34 San Bernardino County 98 3 101 Solano County 30 3 33 Humboldt County 30 2 32 Butte County 17 2 19 Fresno County 24 2 26 Placer County 22 2 24 Stanislaus County 14 1 15 El Dorado County 16 1 17 San Joaquin County 12 1 13 Mendocino County 7 1 8 Siskiyou County 7 0 7 Imperial County 3 0 3 Napa County 3 0 3 Merced County 4 0 4 Trinity County 4 0 4 Tuolumne County 7 0 7 Lake County 4 0 4 San Benito County 14 0 14 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ---------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ---------------------------------------------------------------------------------------------------------------------
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
distribution_plot_wrt_target(df, "Experience", "Personal_Loan")
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
Logistic Regression
Decision Trees
Model can make wrong predictions as:
Which Loss is greater ?
How to reduce this loss i.e need to reduce False Negatives ?
X = df.drop(["Personal_Loan"], axis=1)
Y = df["Personal_Loan"]
# one-hot encoding the categorical variables
X = pd.get_dummies(X, drop_first=True)
X.head()
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Others | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# splitting into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 3500 Number of rows in test data = 1500
print("Percentage of classes in training set:")
print(Y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(Y_test.value_counts(normalize=True))
Percentage of classes in training set: 0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.900667 1 0.099333 Name: Personal_Loan, dtype: float64
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lgModel = LogisticRegression(solver="newton-cg", random_state=1)
lgModel = lgModel.fit(X_train, Y_train)
# log_odds = lgModel.coef_[0]
coef_df = pd.DataFrame(
np.append(lgModel.coef_, lgModel.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df.T
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Others | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | Intercept | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -0.011384 | 0.016362 | 0.0532 | 0.728395 | 0.166959 | 1.670412 | 0.000828 | -0.86027 | 3.284203 | -0.553684 | -1.00049 | -0.25215 | 0.7158 | -0.230152 | -0.090541 | -0.325835 | -0.020083 | 0.581692 | -0.006199 | -0.025223 | 0.426369 | -0.049477 | -0.30033 | 0.033897 | -0.005566 | -0.105671 | 0.641605 | 0.651174 | 0.908893 | 0.039488 | -0.212848 | -0.772676 | 0.106292 | 0.277967 | 0.024412 | -0.418975 | -1.042777 | 0.106113 | 0.178439 | 0.236905 | -0.21257 | -0.032839 | 0.318638 | 0.501137 | -0.26914 | -0.106898 | -0.183659 | 0.204458 | -0.438134 | -13.24035 |
# converting coefficients to odds
odds = np.exp(lgModel.coef_[0])
# finding the percentage change
perc_change_odds = (np.exp(lgModel.coef_[0]) - 1) * 100
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train.columns).T
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Others | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.98868 | 1.016497 | 1.054641 | 2.071753 | 1.181706 | 5.314354 | 1.000828 | 0.423048 | 26.687698 | 0.574829 | 0.367699 | 0.777128 | 2.045823 | 0.794413 | 0.913437 | 0.721924 | 0.980117 | 1.789063 | 0.993820 | 0.975092 | 1.531687 | 0.951727 | 0.740574 | 1.034478 | 0.994449 | 0.899721 | 1.899527 | 1.917792 | 2.481574 | 1.040278 | 0.808279 | 0.461776 | 1.112146 | 1.320442 | 1.024713 | 0.657721 | 0.352475 | 1.111947 | 1.195350 | 1.267321 | 0.808504 | 0.967694 | 1.375253 | 1.650597 | 0.764036 | 0.898617 | 0.832220 | 1.226859 | 0.645239 |
| Change_odd% | -1.13196 | 1.649673 | 5.464055 | 107.175319 | 18.170595 | 431.435444 | 0.082837 | -57.695203 | 2568.769805 | -42.517148 | -63.230066 | -22.287154 | 104.582322 | -20.558742 | -8.656327 | -27.807556 | -1.988269 | 78.906304 | -0.618016 | -2.490783 | 53.168656 | -4.827257 | -25.942624 | 3.447842 | -0.555060 | -10.027898 | 89.952694 | 91.779178 | 148.157419 | 4.027802 | -19.172062 | -53.822409 | 11.214605 | 32.044211 | 2.471281 | -34.227931 | -64.752536 | 11.194726 | 19.535008 | 26.732067 | -19.149599 | -3.230602 | 37.525277 | 65.059701 | -23.596407 | -10.138273 | -16.778041 | 22.685949 | -35.476077 |
Age: Holding all other features constant a unit change in Age will decrease the odds of a customer buying a personal loan by 0.98 times or a 1.13% decrease in the odds.
Experience: Holding all other features constant a unit change in Experience will increase the odds of a customer buying a personal loan by 1.01 times or a 1.64% increase in the odds.
Family: Holding all other features constant a unit change in Family will increase the odds of a customer buying a personal loan by 2.07 times or a 107.17% increase in the odds.
Education: Holding all other features constant a unit change in Education will increase the odds of a customer buying a personal loan by 5.31 times or a 431.43% increase in the odds.
Securities_Account: Holding all other features constant a unit change in Securities_Account will decrease the odds of a customer buying a personal loan by 0.42 times or a 57.69% decrease in the odds.
CD_Account: Holding all other features constant a unit change in CD_Account will increase the odds of a customer buying a personal loan by 26.68 times or a 2568.76% increase in the odds.
Online: Holding all other features constant a unit change in Online will decrease the odds of a customer buying a personal loan by 0.57 times or a 42.51% decrease in the odds.
CreditCard: Holding all other features constant a unit change in CreditCard will decrease the odds of a customer buying a personal loan by 0.36 times or a 63.23% decrease in the odds.
Interpretation for other attributes can be made similarly.
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lgModel, X_train, Y_train)
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
lgModel, X_train, Y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.954286 | 0.649547 | 0.830116 | 0.728814 |
logit_roc_auc_train = roc_auc_score(Y_train, lgModel.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(Y_train, lgModel.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(Y_train, lgModel.predict_proba(X_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.12381122722846054
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lgModel, X_train, Y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lgModel, X_train, Y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.911714 | 0.882175 | 0.519573 | 0.653975 |
y_scores = lgModel.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(Y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# find the point where precision and recall meet
[i for i, j in zip(prec, rec) if i == j]
[0.7371601208459214]
[index for index, (p, r) in enumerate(zip(prec, rec)) if p == r]
[2043]
tre[2043]
0.3235660602196274
# setting the threshold
optimal_threshold_pr_curve = 0.32
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lgModel, X_train, Y_train, threshold=optimal_threshold_pr_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lgModel, X_train, Y_train, threshold=optimal_threshold_pr_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.949714 | 0.740181 | 0.731343 | 0.735736 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.123 Threshold",
"Logistic Regression-0.32 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.123 Threshold | Logistic Regression-0.32 Threshold | |
|---|---|---|---|
| Accuracy | 0.954286 | 0.911714 | 0.949714 |
| Recall | 0.649547 | 0.882175 | 0.740181 |
| Precision | 0.830116 | 0.519573 | 0.731343 |
| F1 | 0.728814 | 0.653975 | 0.735736 |
Using the model with default threshold
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lgModel, X_test, Y_test)
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
lgModel, X_test, Y_test
)
print("Test set performance:")
log_reg_model_test_perf
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945333 | 0.57047 | 0.825243 | 0.674603 |
ROC-AUC on test set
logit_roc_auc_test = roc_auc_score(Y_test, lgModel.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(Y_test, lgModel.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using the model with threshold of 0.123
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lgModel, X_test, Y_test, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lgModel, X_test, Y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.908 | 0.85906 | 0.522449 | 0.649746 |
Using the model with threshold 0.32
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lgModel, X_test, Y_test, threshold=optimal_threshold_curve
)
log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lgModel, X_test, Y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945333 | 0.651007 | 0.76378 | 0.702899 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.123 Threshold",
"Logistic Regression-0.32 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.123 Threshold | Logistic Regression-0.32 Threshold | |
|---|---|---|---|
| Accuracy | 0.954286 | 0.911714 | 0.949714 |
| Recall | 0.649547 | 0.882175 | 0.740181 |
| Precision | 0.830116 | 0.519573 | 0.731343 |
| F1 | 0.728814 | 0.653975 | 0.735736 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.123 Threshold",
"Logistic Regression-0.32 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.123 Threshold | Logistic Regression-0.32 Threshold | |
|---|---|---|---|
| Accuracy | 0.945333 | 0.908000 | 0.945333 |
| Recall | 0.570470 | 0.859060 | 0.651007 |
| Precision | 0.825243 | 0.522449 | 0.763780 |
| F1 | 0.674603 | 0.649746 | 0.702899 |
# Fit the model on train
model = LogisticRegression(solver="newton-cg", n_jobs=-1, random_state=1, max_iter=100)
print(f"Total features available is {X_train.columns.nunique()}")
Total features available is 49
# we will first build model with all varaible
sfs = SFS(
model,
k_features=49,
forward=True,
floating=False,
scoring="recall",
verbose=2,
cv=3,
n_jobs=-1,
)
sfs = sfs.fit(X_train, Y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 2.8s [Parallel(n_jobs=-1)]: Done 43 out of 49 | elapsed: 3.0s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 49 out of 49 | elapsed: 3.0s finished [2021-10-09 03:19:06] Features: 1/49 -- score: 0.32645372645372644[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.2s [Parallel(n_jobs=-1)]: Done 42 out of 48 | elapsed: 0.6s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 48 out of 48 | elapsed: 0.7s finished [2021-10-09 03:19:07] Features: 2/49 -- score: 0.534916734916735[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 40 out of 47 | elapsed: 0.8s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 47 out of 47 | elapsed: 0.8s finished [2021-10-09 03:19:08] Features: 3/49 -- score: 0.5923286923286923[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 39 out of 46 | elapsed: 0.8s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 46 out of 46 | elapsed: 0.8s finished [2021-10-09 03:19:09] Features: 4/49 -- score: 0.6165165165165165[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 37 out of 45 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 0.8s finished [2021-10-09 03:19:09] Features: 5/49 -- score: 0.6316134316134315[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 36 out of 44 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 44 out of 44 | elapsed: 0.8s finished [2021-10-09 03:19:10] Features: 6/49 -- score: 0.6346164346164346[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 34 out of 43 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 43 out of 43 | elapsed: 0.9s finished [2021-10-09 03:19:11] Features: 7/49 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 33 out of 42 | elapsed: 0.9s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 42 out of 42 | elapsed: 0.9s finished [2021-10-09 03:19:12] Features: 8/49 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.4s [Parallel(n_jobs=-1)]: Done 31 out of 41 | elapsed: 0.8s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 41 out of 41 | elapsed: 1.0s finished [2021-10-09 03:19:13] Features: 9/49 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.4s [Parallel(n_jobs=-1)]: Done 30 out of 40 | elapsed: 0.8s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 0.9s finished [2021-10-09 03:19:14] Features: 10/49 -- score: 0.6376467376467376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 28 out of 39 | elapsed: 0.8s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 39 out of 39 | elapsed: 0.9s finished [2021-10-09 03:19:15] Features: 11/49 -- score: 0.6406770406770407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 27 out of 38 | elapsed: 0.7s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 38 out of 38 | elapsed: 0.9s finished [2021-10-09 03:19:16] Features: 12/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 25 out of 37 | elapsed: 0.7s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 37 out of 37 | elapsed: 0.9s finished [2021-10-09 03:19:17] Features: 13/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 24 out of 36 | elapsed: 0.7s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 0.9s finished [2021-10-09 03:19:18] Features: 14/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 22 out of 35 | elapsed: 0.7s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 35 out of 35 | elapsed: 0.9s finished [2021-10-09 03:19:18] Features: 15/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 21 out of 34 | elapsed: 0.7s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 34 out of 34 | elapsed: 0.9s finished [2021-10-09 03:19:19] Features: 16/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 19 out of 33 | elapsed: 0.7s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 33 out of 33 | elapsed: 0.9s finished [2021-10-09 03:19:20] Features: 17/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 18 out of 32 | elapsed: 0.7s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 32 out of 32 | elapsed: 0.8s finished [2021-10-09 03:19:21] Features: 18/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 16 out of 31 | elapsed: 0.6s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 31 out of 31 | elapsed: 0.8s finished [2021-10-09 03:19:22] Features: 19/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 15 out of 30 | elapsed: 0.5s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 0.8s finished [2021-10-09 03:19:23] Features: 20/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 13 out of 29 | elapsed: 0.5s remaining: 0.6s [Parallel(n_jobs=-1)]: Done 29 out of 29 | elapsed: 0.8s finished [2021-10-09 03:19:23] Features: 21/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 12 out of 28 | elapsed: 0.5s remaining: 0.6s [Parallel(n_jobs=-1)]: Done 28 out of 28 | elapsed: 0.8s finished [2021-10-09 03:19:24] Features: 22/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 10 out of 27 | elapsed: 0.5s remaining: 0.9s [Parallel(n_jobs=-1)]: Done 24 out of 27 | elapsed: 0.8s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 27 out of 27 | elapsed: 0.8s finished [2021-10-09 03:19:25] Features: 23/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 out of 26 | elapsed: 0.5s remaining: 0.9s [Parallel(n_jobs=-1)]: Done 23 out of 26 | elapsed: 0.8s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 26 out of 26 | elapsed: 0.8s finished [2021-10-09 03:19:26] Features: 24/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 25 | elapsed: 0.5s remaining: 1.3s [Parallel(n_jobs=-1)]: Done 20 out of 25 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 25 out of 25 | elapsed: 0.8s finished [2021-10-09 03:19:27] Features: 25/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 6 out of 24 | elapsed: 0.5s remaining: 1.5s [Parallel(n_jobs=-1)]: Done 19 out of 24 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 24 out of 24 | elapsed: 0.8s finished [2021-10-09 03:19:27] Features: 26/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 23 | elapsed: 0.6s remaining: 2.7s [Parallel(n_jobs=-1)]: Done 16 out of 23 | elapsed: 0.7s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 23 out of 23 | elapsed: 0.9s finished [2021-10-09 03:19:28] Features: 27/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 22 | elapsed: 0.5s remaining: 3.4s [Parallel(n_jobs=-1)]: Done 15 out of 22 | elapsed: 0.7s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 22 out of 22 | elapsed: 0.8s finished [2021-10-09 03:19:29] Features: 28/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 12 out of 21 | elapsed: 0.7s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 21 out of 21 | elapsed: 0.8s finished [2021-10-09 03:19:30] Features: 29/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 11 out of 20 | elapsed: 0.6s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 20 out of 20 | elapsed: 0.8s finished [2021-10-09 03:19:31] Features: 30/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 8 out of 19 | elapsed: 0.6s remaining: 0.9s [Parallel(n_jobs=-1)]: Done 19 out of 19 | elapsed: 0.8s finished [2021-10-09 03:19:32] Features: 31/49 -- score: 0.6436800436800437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 18 | elapsed: 0.6s remaining: 1.0s [Parallel(n_jobs=-1)]: Done 18 out of 18 | elapsed: 0.8s finished [2021-10-09 03:19:32] Features: 32/49 -- score: 0.6406770406770407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 17 | elapsed: 0.7s remaining: 2.1s [Parallel(n_jobs=-1)]: Done 13 out of 17 | elapsed: 0.7s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 17 out of 17 | elapsed: 0.8s finished [2021-10-09 03:19:33] Features: 33/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 16 | elapsed: 0.7s remaining: 3.0s [Parallel(n_jobs=-1)]: Done 12 out of 16 | elapsed: 0.7s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 16 out of 16 | elapsed: 0.8s finished [2021-10-09 03:19:34] Features: 34/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 8 out of 15 | elapsed: 0.8s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 15 out of 15 | elapsed: 0.9s finished [2021-10-09 03:19:35] Features: 35/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 14 | elapsed: 0.7s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 14 out of 14 | elapsed: 0.8s finished [2021-10-09 03:19:36] Features: 36/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 13 | elapsed: 0.6s remaining: 2.1s [Parallel(n_jobs=-1)]: Done 10 out of 13 | elapsed: 0.7s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 13 out of 13 | elapsed: 0.7s finished [2021-10-09 03:19:36] Features: 37/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 out of 12 | elapsed: 0.5s remaining: 2.4s [Parallel(n_jobs=-1)]: Done 9 out of 12 | elapsed: 0.6s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 0.6s finished [2021-10-09 03:19:37] Features: 38/49 -- score: 0.6406497406497407[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 11 | elapsed: 0.4s remaining: 0.8s [Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 0.5s finished [2021-10-09 03:19:38] Features: 39/49 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 10 | elapsed: 0.3s remaining: 0.8s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.5s finished [2021-10-09 03:19:38] Features: 40/49 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 9 | elapsed: 0.4s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 0.4s finished [2021-10-09 03:19:39] Features: 41/49 -- score: 0.6345891345891346[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 out of 8 | elapsed: 0.3s remaining: 0.9s [Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 0.4s finished [2021-10-09 03:19:39] Features: 42/49 -- score: 0.6315588315588316[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.3s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.4s finished [2021-10-09 03:19:40] Features: 43/49 -- score: 0.6285558285558285[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 6 | elapsed: 0.3s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 0.4s finished [2021-10-09 03:19:40] Features: 44/49 -- score: 0.6255801255801257[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 0.4s finished [2021-10-09 03:19:41] Features: 45/49 -- score: 0.6255255255255255[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.3s finished [2021-10-09 03:19:41] Features: 46/49 -- score: 0.6255528255528255[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.3s finished [2021-10-09 03:19:42] Features: 47/49 -- score: 0.6225225225225225[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 0.4s finished [2021-10-09 03:19:42] Features: 48/49 -- score: 0.6164619164619164[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.3s finished [2021-10-09 03:19:43] Features: 49/49 -- score: 0.6164619164619164
fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_dev", figsize=(12, 5))
plt.ylim([0.3, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.xticks(rotation=90)
plt.show()
sfs1 = SFS(
model,
k_features=8,
forward=True,
floating=False,
scoring="recall",
verbose=2,
cv=3,
n_jobs=-1,
)
sfs1 = sfs1.fit(X_train, Y_train)
fig1 = plot_sfs(sfs1.get_metric_dict(), kind="std_dev", figsize=(10, 5))
plt.ylim([0.3, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.grid()
plt.show()
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 43 out of 49 | elapsed: 0.3s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 49 out of 49 | elapsed: 0.3s finished [2021-10-09 03:19:43] Features: 1/8 -- score: 0.32645372645372644[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 42 out of 48 | elapsed: 0.7s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 48 out of 48 | elapsed: 0.7s finished [2021-10-09 03:19:44] Features: 2/8 -- score: 0.534916734916735[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 40 out of 47 | elapsed: 0.8s remaining: 0.1s [Parallel(n_jobs=-1)]: Done 47 out of 47 | elapsed: 0.9s finished [2021-10-09 03:19:45] Features: 3/8 -- score: 0.5923286923286923[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 39 out of 46 | elapsed: 0.9s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 46 out of 46 | elapsed: 0.9s finished [2021-10-09 03:19:46] Features: 4/8 -- score: 0.6165165165165165[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 37 out of 45 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 0.9s finished [2021-10-09 03:19:47] Features: 5/8 -- score: 0.6316134316134315[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 36 out of 44 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 44 out of 44 | elapsed: 0.9s finished [2021-10-09 03:19:48] Features: 6/8 -- score: 0.6346164346164346[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 34 out of 43 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 43 out of 43 | elapsed: 0.9s finished [2021-10-09 03:19:49] Features: 7/8 -- score: 0.6376194376194376[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 0.3s [Parallel(n_jobs=-1)]: Done 33 out of 42 | elapsed: 0.8s remaining: 0.2s [Parallel(n_jobs=-1)]: Done 42 out of 42 | elapsed: 0.9s finished [2021-10-09 03:19:49] Features: 8/8 -- score: 0.6376194376194376
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)
[2, 3, 5, 8, 10, 11, 20, 29]
Let's look at best 8 variables
X_train.columns[feat_cols]
Index(['Income', 'Family', 'Education', 'CD_Account', 'CreditCard',
'County_Butte County', 'County_Marin County',
'County_Sacramento County'],
dtype='object')
X_train_final = X_train[X_train.columns[feat_cols]]
# Creating new x_test with the same variables that we selected for x_train
X_test_final = X_test[X_train_final.columns]
logreg = LogisticRegression(
solver="newton-cg", penalty="none", verbose=True, n_jobs=-1, random_state=0
)
# There are several optimizer, we are using optimizer called as 'newton-cg' with max_iter equal to 10000
# max_iter indicates number of iteration needed to converge
logreg.fit(X_train_final, Y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.1s finished
LogisticRegression(n_jobs=-1, penalty='none', random_state=0,
solver='newton-cg', verbose=True)
confusion_matrix_sklearn_with_threshold(logreg, X_train_final, Y_train)
log_reg_model_train_perf_SFS = model_performance_classification_sklearn_with_threshold(
logreg, X_train_final, Y_train
)
print("Training performance:")
log_reg_model_train_perf_SFS
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953143 | 0.634441 | 0.83004 | 0.719178 |
confusion_matrix_sklearn_with_threshold(logreg, X_test_final, Y_test)
log_reg_model_test_perf_SFS = model_performance_classification_sklearn_with_threshold(
logreg, X_test_final, Y_test
)
print("Test set performance:")
log_reg_model_test_perf_SFS
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945333 | 0.543624 | 0.852632 | 0.663934 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
log_reg_model_train_perf_SFS.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.123 Threshold",
"Logistic Regression-0.32 Threshold",
"Logistic Regression - SFS",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.123 Threshold | Logistic Regression-0.32 Threshold | Logistic Regression - SFS | |
|---|---|---|---|---|
| Accuracy | 0.954286 | 0.911714 | 0.949714 | 0.953143 |
| Recall | 0.649547 | 0.882175 | 0.740181 | 0.634441 |
| Precision | 0.830116 | 0.519573 | 0.731343 | 0.830040 |
| F1 | 0.728814 | 0.653975 | 0.735736 | 0.719178 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
log_reg_model_test_perf_SFS.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.123 Threshold",
"Logistic Regression-0.32 Threshold",
"Logistic Regression - SFS",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.123 Threshold | Logistic Regression-0.32 Threshold | Logistic Regression - SFS | |
|---|---|---|---|---|
| Accuracy | 0.945333 | 0.908000 | 0.945333 | 0.945333 |
| Recall | 0.570470 | 0.859060 | 0.651007 | 0.543624 |
| Precision | 0.825243 | 0.522449 | 0.763780 | 0.852632 |
| F1 | 0.674603 | 0.649746 | 0.702899 | 0.663934 |
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
ax = sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
ax.xaxis.set_label_position("top")
ax.xaxis.tick_top()
# We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
dTreeModel = DecisionTreeClassifier(criterion="gini", random_state=1)
dTreeModel.fit(X_train, Y_train)
DecisionTreeClassifier(random_state=1)
decision_tree_perf_train = model_performance_classification_sklearn(
dTreeModel, X_train, Y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(dTreeModel, X_train, Y_train)
decision_tree_perf_test = model_performance_classification_sklearn(
dTreeModel, X_test, Y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.976667 | 0.865772 | 0.895833 | 0.880546 |
confusion_matrix_sklearn(dTreeModel, X_test, Y_test)
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Others', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County']
plt.figure(figsize=(20, 30))
tree.plot_tree(
dTreeModel,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTreeModel, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- County_Santa Barbara County <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education <= 2.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- Education > 2.50 | | | | | | | | |--- Experience <= 13.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Experience > 13.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- County_Santa Barbara County > 0.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | |--- Experience > 3.50 | | | | | |--- Experience <= 34.00 | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Experience > 34.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- County_Riverside County <= 0.50 | | | | | |--- Age <= 26.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 26.50 | | | | | | |--- County_San Francisco County <= 0.50 | | | | | | | |--- Age <= 62.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- Age <= 36.50 | | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Age > 36.50 | | | | | | | | | | |--- CCAvg <= 3.45 | | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 3.45 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [51.00, 0.00] class: 0 | | | | | | | |--- Age > 62.50 | | | | | | | | |--- Income <= 84.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Income > 84.00 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- County_San Francisco County > 0.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- County_Riverside County > 0.50 | | | | | |--- Income <= 73.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 73.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Income <= 102.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 102.00 | | | | | | | | |--- Family <= 2.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Family > 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- Family <= 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Family > 2.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Family <= 2.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Family > 2.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Experience <= 10.00 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Experience > 10.00 | | | | | | | |--- Mortgage <= 284.50 | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Mortgage > 284.50 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
importances = dTreeModel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": [np.arange(1, 8), None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, Y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, Y_train)
DecisionTreeClassifier(min_impurity_decrease=1e-06, random_state=1)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, Y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(estimator, X_train, Y_train)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, Y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.976667 | 0.865772 | 0.895833 | 0.880546 |
confusion_matrix_sklearn(estimator, X_test, Y_test)
plt.figure(figsize=(15, 12))
tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, Y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path).T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ccp_alphas | 0.0 | 0.000187 | 0.000269 | 0.000273 | 0.000274 | 0.000359 | 0.000381 | 0.000381 | 0.000381 | 0.000381 | 0.000381 | 0.000435 | 0.000476 | 0.000514 | 0.000578 | 0.000582 | 0.000607 | 0.000621 | 0.000811 | 0.001552 | 0.002333 | 0.003024 | 0.003294 | 0.006473 | 0.023866 | 0.056365 |
| impurities | 0.0 | 0.000562 | 0.001636 | 0.002182 | 0.004371 | 0.005447 | 0.005828 | 0.006209 | 0.006590 | 0.006971 | 0.007352 | 0.007787 | 0.008263 | 0.009805 | 0.012118 | 0.012701 | 0.013307 | 0.013928 | 0.017985 | 0.019536 | 0.021869 | 0.024893 | 0.028187 | 0.034659 | 0.058525 | 0.171255 |
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, Y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(Y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(Y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006209286209286216, random_state=1)
decision_tree_postpruned_perf_train = model_performance_classification_sklearn(
best_model, X_train, Y_train
)
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.991714 | 0.94864 | 0.96319 | 0.95586 |
confusion_matrix_sklearn(best_model, X_train, Y_train)
decision_tree_postpruned_perf_test = model_performance_classification_sklearn(
best_model, X_test, Y_test
)
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.885906 | 0.929577 | 0.907216 |
confusion_matrix_sklearn(best_model, X_test, Y_test)
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- weights: [63.00, 3.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | |--- Experience > 3.50 | | | | | |--- Experience <= 34.00 | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Experience > 34.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [11.00, 28.00] class: 1 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education 0.436448 Income 0.324182 Family 0.155849 CCAvg 0.041142 CD_Account 0.024692 Experience 0.012037 County_Santa Clara County 0.005650 Age 0.000000 County_San Benito County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_San Francisco County 0.000000 County_San Joaquin County 0.000000 County_San Luis Obispo County 0.000000 County_San Mateo County 0.000000 County_Santa Barbara County 0.000000 County_Siskiyou County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Riverside County 0.000000 County_Solano County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Ventura County 0.000000 County_Sacramento County 0.000000 County_Napa County 0.000000 County_Placer County 0.000000 County_Humboldt County 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_El Dorado County 0.000000 County_Fresno County 0.000000 County_Imperial County 0.000000 County_Others 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 County_Merced County 0.000000 County_Monterey County 0.000000 County_Orange County 0.000000 County_Yolo County 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_postpruned_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.991714 |
| Recall | 1.0 | 1.0 | 0.948640 |
| Precision | 1.0 | 1.0 | 0.963190 |
| F1 | 1.0 | 1.0 | 0.955860 |
# test performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_postpruned_perf_test.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_train_comp_df
Test set performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.976667 | 0.976667 | 0.982000 |
| Recall | 0.865772 | 0.865772 | 0.885906 |
| Precision | 0.895833 | 0.895833 | 0.929577 |
| F1 | 0.880546 | 0.880546 | 0.907216 |
# training performance comparison
new_models_train_comp_df = pd.concat(
[
log_reg_model_train_perf_threshold_auc_roc.T,
decision_tree_postpruned_perf_train.T,
],
axis=1,
)
new_models_train_comp_df.columns = [
"Logistic Regression-0.123 Threshold",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
new_models_train_comp_df
Training performance comparison:
| Logistic Regression-0.123 Threshold | Decision Tree (Post-Pruning) | |
|---|---|---|
| Accuracy | 0.911714 | 0.991714 |
| Recall | 0.882175 | 0.948640 |
| Precision | 0.519573 | 0.963190 |
| F1 | 0.653975 | 0.955860 |
# test performance comparison
new_models_test_comp_df = pd.concat(
[
log_reg_model_test_perf_threshold_auc_roc.T,
decision_tree_postpruned_perf_test.T,
],
axis=1,
)
new_models_test_comp_df.columns = [
"Logistic Regression-0.123 Threshold",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
new_models_test_comp_df
Test set performance comparison:
| Logistic Regression-0.123 Threshold | Decision Tree (Post-Pruning) | |
|---|---|---|
| Accuracy | 0.908000 | 0.982000 |
| Recall | 0.859060 | 0.885906 |
| Precision | 0.522449 | 0.929577 |
| F1 | 0.649746 | 0.907216 |
Conclusion:
Based on the above reasons, we should use Decision Tree (Post-Pruning) for modeling